Cluster Stopping Rules For Word Sense Discrimination

نویسندگان

  • Guergana Savova
  • Terry Therneau
  • Christopher G. Chute
چکیده

As text data becomes plentiful, unsupervised methods for Word Sense Disambiguation (WSD) become more viable. A problem encountered in applying WSD methods is finding the exact number of senses an ambiguity has in a training corpus collected in an automated manner. That number is not known a priori; rather it needs to be determined based on the data itself. We address that problem using cluster stopping methods. Such techniques have not previously applied to WSD. We implement the methods of Calinski and Harabasz (1975) and Hartigan (1975) and our adaptation of the Gap statistic (Tibshirani, Walter and Hastie, 2001). For evaluation, we use the WSD Test Set from the National Library of Medicine, whose sense inventory is the Unified Medical Language System. The best accuracy for selecting the correct number of clusters is 0.60 with the C&H method. Our error analysis shows that the cluster stopping methods make finergrained sense distinctions by creating additional clusters. The highest F-scores (82.89), indicative of the quality of cluster membership assignment, are comparable to the baseline majority sense (82.63) and point to a path towards accuracy improvement via additional cluster pruning. The importance and significance of the current work is in applying cluster stopping rules to WSD.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Cluster Stopping with Criterion Functions and the Gap Statistic

SenseClusters is a freely available system that clusters similar contexts. It can be applied to a wide range of problems, although here we focus on word sense and name discrimination. It supports several different measures for automatically determining the number of clusters in which a collection of contexts should be grouped. These can be used to discover the number of senses in which a word i...

متن کامل

Unsupervised Discrimination and Labeling of Ambiguous Names

This paper describes adaptations of unsupervised word sense discrimination techniques to the problem of name discrimination. These methods cluster the contexts containing an ambiguous name, such that each cluster refers to a unique underlying person or place. We also present new techniques to assign meaningful labels to the discovered clusters.

متن کامل

Finding the Optimal Number of Clusters for Word Sense Disambiguation

Ambiguity is an inherent problem for many tasks in Natural Language Processing. Unsupervised and semi-supervised approaches to ambiguity resolution are appealing as they lower the cost of manual labour. Typically, those methods struggle with estimation of number of senses without supervision. This paper shows research on using stopping functions applied to clustering algorithms for estimation o...

متن کامل

Improving Word Sense Discrimination with Gloss Augmented Feature Vectors

This paper presents a method of unsupervised word sense discrimination that augments co–occurrence feature vectors derived from raw untagged corpora with information from the glosses found in a machine readable dictionary. Each content word that occurs in the context of a target word to be discriminated is represented by a co-occurrence feature vector. Each of these vectors is augmented with th...

متن کامل

I2R: Three Systems for Word Sense Discrimination, Chinese Word Sense Disambiguation, and English Word Sense Disambiguation

This paper describes the implementation of our three systems at SemEval-2007, for task 2 (word sense discrimination), task 5 (Chinese word sense disambiguation), and the first subtask in task 17 (English word sense disambiguation). For task 2, we applied a cluster validation method to estimate the number of senses of a target word in untagged data, and then grouped the instances of this target ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006